Skip to content

feat: implement DuckDB filesystem integration for Vortex file handling#6198

Open
iceboundrock wants to merge 2 commits intovortex-data:developfrom
Lychee-Technology:develop
Open

feat: implement DuckDB filesystem integration for Vortex file handling#6198
iceboundrock wants to merge 2 commits intovortex-data:developfrom
Lychee-Technology:develop

Conversation

@iceboundrock
Copy link

@iceboundrock iceboundrock commented Jan 29, 2026

@joseph-isaacs joseph-isaacs added the action/benchmark Trigger full benchmarks to run on this PR label Jan 29, 2026
@codspeed-hq
Copy link

codspeed-hq bot commented Jan 29, 2026

Merging this PR will not alter performance

✅ 1135 untouched benchmarks
⏩ 1268 skipped benchmarks1


Comparing Lychee-Technology:develop (b2db9de) with develop (6f10740)

Open in CodSpeed

Footnotes

  1. 1268 benchmarks were skipped, so the baseline results were used instead. If they were deleted from the codebase, click here and archive them to remove them from the performance reports.

@0ax1 0ax1 self-requested a review January 29, 2026 10:40
@0ax1 0ax1 added action/benchmark-sql Trigger SQL benchmarks to run on this PR changelog/feature A new feature and removed action/benchmark Trigger full benchmarks to run on this PR labels Jan 29, 2026
@joseph-isaacs joseph-isaacs added action/benchmark-sql Trigger SQL benchmarks to run on this PR and removed action/benchmark-sql Trigger SQL benchmarks to run on this PR labels Jan 29, 2026
Copy link
Contributor

@0ax1 0ax1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for taking the time to look into this. There's a couple of issues around thread safety / locking, error handling and memory leaks.

If you used AI to assist in writing the code for your PR please mention this in your PR description (see: https://github.com/vortex-data/vortex/blob/develop/CONTRIBUTING.md).

@joseph-isaacs joseph-isaacs added action/benchmark-sql Trigger SQL benchmarks to run on this PR and removed action/benchmark-sql Trigger SQL benchmarks to run on this PR labels Feb 2, 2026
Copy link
Contributor

@0ax1 0ax1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Almost there, thanks for taking the time to address the comments!

@ruoshili ruoshili force-pushed the develop branch 2 times, most recently from 66fedc0 to b6e15a3 Compare February 11, 2026 02:21
@iceboundrock
Copy link
Author

Almost there, thanks for taking the time to address the comments!

Hi @0ax1 , any other comments?

@0ax1
Copy link
Contributor

0ax1 commented Feb 11, 2026

Almost there, thanks for taking the time to address the comments!

Hi @0ax1 , any other comments?

Currently swamped with other work but will get back to it when I find a free moment.

@gatesn
Copy link
Contributor

gatesn commented Feb 13, 2026

@iceboundrock - quick question on this. Do you know if it's possible to grab the config back out from DuckDB's filesystems?

I don't want to block this PR, don't worry(!), but we do have plans to implement a high performance I/O layer at some point and it would probably make more sense to try and configure our own I/O using duckdb configs, rather than attempt to optimize calling back into DuckDB's I/O layer.

@gatesn gatesn added action/benchmark-sql Trigger SQL benchmarks to run on this PR and removed action/benchmark-sql Trigger SQL benchmarks to run on this PR labels Feb 13, 2026
@iceboundrock
Copy link
Author

iceboundrock commented Feb 13, 2026

@iceboundrock - quick question on this. Do you know if it's possible to grab the config back out from DuckDB's filesystems?

I don't want to block this PR, don't worry(!), but we do have plans to implement a high performance I/O layer at some point and it would probably make more sense to try and configure our own I/O using duckdb configs, rather than attempt to optimize calling back into DuckDB's I/O layer.

Could you please elaborate on what config refers to? If it’s S3 credentials, HTTP Proxy configuration, etc., I don’t believe we can read them from DuckDB. They are stored in Secrets Manager and the read API redacts sensitive information.

We probably can do the same as the httpfs extension does. Code link:

@0ax1 0ax1 added action/benchmark-sql Trigger SQL benchmarks to run on this PR and removed action/benchmark-sql Trigger SQL benchmarks to run on this PR labels Feb 16, 2026
@0ax1
Copy link
Contributor

0ax1 commented Feb 16, 2026

@iceboundrock - quick question on this. Do you know if it's possible to grab the config back out from DuckDB's filesystems?
I don't want to block this PR, don't worry(!), but we do have plans to implement a high performance I/O layer at some point and it would probably make more sense to try and configure our own I/O using duckdb configs, rather than attempt to optimize calling back into DuckDB's I/O layer.

Could you please elaborate on what config refers to? If it’s S3 credentials, HTTP Proxy configuration, etc., I don’t believe we can read them from DuckDB. They are stored in Secrets Manager and the read API redacts sensitive information.

We probably can do the same as the httpfs extension does. Code link:

* [1](https://github.com/duckdb/duckdb-httpfs/blob/6afe39d5be688b032146baa97e49601b50c9456e/src/httpfs_extension.cpp#L73-L105)

* [2](https://github.com/duckdb/duckdb-httpfs/blob/main/src/create_secret_functions.cpp)

Let's handle this in a follow up to limit the scope of this PR.

Copy link
Contributor

@0ax1 0ax1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR looks great and CI (including all benchmarks) pass.

Last bit of fine-tuning and this is ready to land.


try {
auto *wrapper = reinterpret_cast<FileHandleWrapper *>(handle);
wrapper->handle->Write(QueryContext(), const_cast<uint8_t *>(buffer), len, offset);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Drop the const_cast and adjust the signature accordingly.

}
}

// SAFETY: ClientContext is an opaque pointer owned by DuckDB and remains valid for the lifetime of
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are actually not needed, as we pass around SendableClientContext.

Drop:

unsafe impl Send for ClientContext {}
unsafe impl Sync for ClientContext {}

let mut out_len: cpp::idx_t = 0;
let offset = self.pos;

let status = unsafe {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use spawn blocking, instead of a blocking write.

vortex_err!("{message}")
}

pub fn duckdb_fs_glob(ctx: &ClientContext, pattern: &str) -> VortexResult<Vec<url::Url>> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pub(crate)

}

/// A VortexReadAt implementation backed by DuckDB's filesystem (e.g., httpfs/s3).
pub struct DuckDbFsReader {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pub(crate)

}

impl DuckDbFsReader {
pub unsafe fn open_url(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pub(crate)

unsafe impl Send for DuckDbFsReader {}
unsafe impl Sync for DuckDbFsReader {}

pub struct DuckDbFsWriter {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pub(crate)

}

impl DuckDbFsWriter {
pub unsafe fn create(ctx: cpp::duckdb_vx_client_context, path: &str) -> VortexResult<Self> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pub(crate)

Ok(urls)
}

pub unsafe fn duckdb_fs_create_writer(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pub(crate)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

action/benchmark-sql Trigger SQL benchmarks to run on this PR changelog/feature A new feature

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants